## 'data.frame': 1599 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1.0 Min. : 4.60 Min. :0.1200 Min. :0.000
## 1st Qu.: 400.5 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090
## Median : 800.0 Median : 7.90 Median :0.5200 Median :0.260
## Mean : 800.0 Mean : 8.32 Mean :0.5278 Mean :0.271
## 3rd Qu.:1199.5 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420
## Max. :1599.0 Max. :15.90 Max. :1.5800 Max. :1.000
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.900 Min. :0.01200 Min. : 1.00
## 1st Qu.: 1.900 1st Qu.:0.07000 1st Qu.: 7.00
## Median : 2.200 Median :0.07900 Median :14.00
## Mean : 2.539 Mean :0.08747 Mean :15.87
## 3rd Qu.: 2.600 3rd Qu.:0.09000 3rd Qu.:21.00
## Max. :15.500 Max. :0.61100 Max. :72.00
## total.sulfur.dioxide density pH sulphates
## Min. : 6.00 Min. :0.9901 Min. :2.740 Min. :0.3300
## 1st Qu.: 22.00 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500
## Median : 38.00 Median :0.9968 Median :3.310 Median :0.6200
## Mean : 46.47 Mean :0.9967 Mean :3.311 Mean :0.6581
## 3rd Qu.: 62.00 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300
## Max. :289.00 Max. :1.0037 Max. :4.010 Max. :2.0000
## alcohol quality
## Min. : 8.40 Min. :3.000
## 1st Qu.: 9.50 1st Qu.:5.000
## Median :10.20 Median :6.000
## Mean :10.42 Mean :5.636
## 3rd Qu.:11.10 3rd Qu.:6.000
## Max. :14.90 Max. :8.000
In this exercise, I will explore a data set on wine quality. The objective is to explore which chemical properties influence the quality of red wines. I’ll start by exploring the data using the statistical program, R. As interesting relationships in the data are discovered, I’ll produce and refine plots to illustrate them.
There are 1599 observations of 13 numeric variables. X appears to be the unique identifier. Quality is an ordered, categorical, discrete variable. From the literature, this was on a 0-10 scale, and was rated by at least 3 wine experts. The values ranged only from 3 to 8, with a mean of 5.64 and median of 6. All other variables seem to be continuous quantities; (w/ the exception of the .sulfur.dioxide suffixes). From the variable descriptions, it appears that fixed.acidity ~ volatile.acidity and free.sulfur.dioxide ~ total.sulfur.dioxide may possible by dependent, subsets of each other.
Wine$X <- as.factor(Wine$X)
table(Wine$quality)
##
## 3 4 5 6 7 8
## 10 53 681 638 199 18
Looking at Wine Quality, we find that it forms a normal distribution, with most of the data structured around wine quality of 5 or 6.
To explore better, we can create a new variable Rating, which classifies Wines into Good, Bad, and Average depending on their quality scores.
BAD is Quality less than 5. AVERAGE is Quality less than 7 but greater than or equal to 5. GOOD is any number above and including 7.
Wine$rating <- ifelse(Wine$quality < 5, 'bad', ifelse(
Wine$quality < 7, 'average', 'good'))
Wine$rating <- ordered(Wine$rating,
levels = c('bad', 'average', 'good'))
summary(Wine$rating)
## bad average good
## 63 1319 217
It appears that density and pH are normally distributed, with a few outliers. Fixed and volatile acidity, sulfur dioxides, sulphates, and alcohol seem to be long-tailed. Qualitatively, residual sugar and chlorides have extreme outliers. Citric acid appeared to have a large number of zero values. This might be a case of non reporting.
ggplot(Wine,aes(x=fixed.acidity))+geom_histogram(fill='red')+scale_x_log10(breaks=4:15)+
xlab('Fixed Acidity')
ggplot(Wine) + geom_histogram(aes(x=volatile.acidity),fill='blue')+
scale_x_log10(breaks=seq(0.1,1,0.1))
ggplot(Wine) +
geom_histogram(aes(x=citric.acid),fill='green') +
scale_x_log10()
As we could clearly see, citric acid was one feature that was found to be not normally distributed on a logarithmic scale. The transformation caused 132 data points to be in the infinite range, telling us that 132 values are 0, since we know Log(0) is infinity.
Let’s verify that.
print(length(subset(Wine, citric.acid == 0)$citric.acid))
## [1] 132
print(log(0))
## [1] -Inf
p1 <- ggplot(data = Wine, aes(x = residual.sugar)) +
geom_histogram() +
scale_x_continuous(lim = c(0, quantile(Wine$residual.sugar, 0.95))) +
xlab('residual.sugar, 95th percentile truncated')
p2 <- p1 + scale_x_log10(breaks=1:9) + xlab('residual.sugar, log10')
grid.arrange(p1, p2, ncol=1)
p3 <- ggplot(data = Wine, aes(x = chlorides)) +
geom_histogram() +
scale_x_continuous(lim = c(0, quantile(Wine$chlorides, 0.95))) +
xlab('chlorides, 95th percentile truncated')
p4 <- p3 + scale_x_log10(breaks=seq(0.01,0.11,0.02)) + xlab('chlorides, log10')
grid.arrange(p3, p4, ncol=1)
p5 <- ggplot(data = Wine, aes(x = sulphates)) +
geom_histogram() +
scale_x_continuous(lim = c(0, quantile(Wine$sulphates, 0.95))) +
xlab('sulphates, 95th percentile truncated')
p6 <- p5 + scale_x_log10(breaks=seq(0,1,0.2)) + xlab('sulphates, log10')
grid.arrange(p5, p6, ncol=1)
rm(p1,p2,p3,p4,p5,p6)
While exploring the univariate histogram distributions, there did not appear to be any bimodal or multimodal distributions that would warrant sub-classification into categorical variables. I considered potentially splitting residual.sugar into ‘sweet wine’ and ‘dry wine’, but a residual sugar of greater than 45 g/L or g/m^3 is needed to classify as a sweet wine.
I instantiated an ordered factor rating, classifying each wine sample as ‘bad’, ‘average’, or ‘good’.
Upon further examination of the data set documentation, it appears that fixed.acidity and volatile.acidity are different types of acids; tartaric acid and acetic acid. I decided to create a combined variable, TAC.acidity, containing the sum of tartaric, acetic, and citric acid.
Wine$TAC.acidity <- Wine$fixed.acidity + Wine$volatile.acidity +
Wine$citric.acid
qplot(Wine$TAC.acidity,main = 'Histogram of TAC Acidity (Tartaric+Acetic+Citric)')
get_simple_boxplot <- function(column, ylab) {
return(qplot(data = Wine, x = 'simple',
y = column, geom = 'boxplot',
xlab = '',
ylab = ylab))
}
grid.arrange(get_simple_boxplot(Wine$fixed.acidity, 'fixed acidity'),
get_simple_boxplot(Wine$volatile.acidity, 'volatile acidity'),
get_simple_boxplot(Wine$citric.acid, 'citric acid'),
get_simple_boxplot(Wine$TAC.acidity, 'TAC acidity'),
get_simple_boxplot(Wine$residual.sugar, 'residual sugar'),
get_simple_boxplot(Wine$chlorides, 'chlorides'),
get_simple_boxplot(Wine$free.sulfur.dioxide, 'free sulf. dioxide'),
get_simple_boxplot(Wine$total.sulfur.dioxide, 'total sulf. dioxide'),
get_simple_boxplot(Wine$density, 'density'),
get_simple_boxplot(Wine$pH, 'pH'),
get_simple_boxplot(Wine$sulphates, 'sulphates'),
get_simple_boxplot(Wine$alcohol, 'alcohol'),
ncol = 4)
In univariate analysis, I chose not to tidy or adjust any data, short of plotting a select few on logarithmic scales. Bivariate boxplots, with X as rating or quality, will be more interesting in showing trends with wine quality.
From exploring these plots, it seems that a ‘good’ wine generally has these trends:
higher fixed acidity (tartaric acid) and citric acid, lower volatile acidity (acetic acid) lower pH (i.e. more acidic) higher sulphates higher alcohol to a lesser extent, lower chlorides and lower density Residual sugar and sulfur dioxides did not seem to have a dramatic impact on the quality or rating of the wines. Interestingly, it appears that different types of acid affect wine quality different; as such, TAC.acidity saw an attenuated trend, as the presence of volatile (acetic) acid accompanied decreased quality.
By utilizing cor.test, I calculated the correlation for each of these variables against quality:
simple_cor_test <- function(x, y) {
return(cor.test(x, as.numeric(y))$estimate)
}
correlations <- c(
simple_cor_test(Wine$fixed.acidity, Wine$quality),
simple_cor_test(Wine$volatile.acidity, Wine$quality),
simple_cor_test(Wine$citric.acid, Wine$quality),
simple_cor_test(Wine$TAC.acidity, Wine$quality),
simple_cor_test(log10(Wine$residual.sugar), Wine$quality),
simple_cor_test(log10(Wine$chlorides), Wine$quality),
simple_cor_test(Wine$free.sulfur.dioxide, Wine$quality),
simple_cor_test(Wine$total.sulfur.dioxide, Wine$quality),
simple_cor_test(Wine$density, Wine$quality),
simple_cor_test(Wine$pH, Wine$quality),
simple_cor_test(log10(Wine$sulphates), Wine$quality),
simple_cor_test(Wine$alcohol, Wine$quality))
names(correlations) <- c('fixed.acidity', 'volatile.acidity', 'citric.acid',
'TAC.acidity', 'log10.residual.sugar',
'log10.chlordies', 'free.sulfur.dioxide',
'total.sulfur.dioxide', 'density', 'pH',
'log10.sulphates', 'alcohol')
correlations
## fixed.acidity volatile.acidity citric.acid
## 0.12405165 -0.39055778 0.22637251
## TAC.acidity log10.residual.sugar log10.chlordies
## 0.10375373 0.02353331 -0.17613996
## free.sulfur.dioxide total.sulfur.dioxide density
## -0.05065606 -0.18510029 -0.17491923
## pH log10.sulphates alcohol
## -0.05773139 0.30864193 0.47616632
As we see, the top 4 come to be: alcohol sulphates (log10) volatile acidity citric acid
Let’s plot them against each other using RATING as a color and facet.
ggplot(data = Wine, aes(x = log10(sulphates), y = alcohol)) +
facet_wrap(~rating) +
geom_jitter(alpha=0.2)
ggplot(data = Wine, aes(x = volatile.acidity, y = alcohol)) +
facet_wrap(~rating) +
geom_point(alpha=0.2)
ggplot(data = Wine, aes(x = citric.acid, y = alcohol)) +
facet_wrap(~rating) +
geom_point(alpha=0.2)
ggplot(data = Wine, aes(x = volatile.acidity, y = log10(sulphates))) +
facet_wrap(~rating) +
geom_jitter(alpha=0.2)
ggplot(data = Wine, aes(x = citric.acid, y = log10(sulphates))) +
facet_wrap(~rating) +
geom_point(alpha=0.2)
ggplot(data = Wine, aes(x = citric.acid, y = volatile.acidity)) +
facet_wrap(~rating) +
geom_point(alpha=0.2)
The relative value of these scatterplots are suspect; if anything, it illustrates how heavily alcohol content affects rating. The weakest bivariate relationship appeared to be alcohol vs. citric acid. The plots were nearly uniformly-distributed. The strongest relationship appeared to be volatile acididty vs. citric acid, which had a negative correlation.
Examining the acidity variables, I saw strong correlations between them:
ggplot(data = Wine, aes(x = fixed.acidity, y = citric.acid)) +
geom_point(alpha=0.3)
cor.test(Wine$fixed.acidity, Wine$citric.acid)
##
## Pearson's product-moment correlation
##
## data: Wine$fixed.acidity and Wine$citric.acid
## t = 36.234, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.6438839 0.6977493
## sample estimates:
## cor
## 0.6717034
ggplot(data = Wine, aes(x = volatile.acidity, y = citric.acid)) +
geom_point(alpha=0.3)
cor.test(Wine$volatile.acidity, Wine$citric.acid)
##
## Pearson's product-moment correlation
##
## data: Wine$volatile.acidity and Wine$citric.acid
## t = -26.489, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.5856550 -0.5174902
## sample estimates:
## cor
## -0.5524957
ggplot(data = Wine, aes(x = log10(TAC.acidity), y = pH)) +
geom_point(alpha=0.3)
cor.test(log10(Wine$TAC.acidity), Wine$pH)
##
## Pearson's product-moment correlation
##
## data: log10(Wine$TAC.acidity) and Wine$pH
## t = -39.663, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.7283140 -0.6788653
## sample estimates:
## cor
## -0.7044435
Most notably, base 10 logarithm TAC.acidity correlated very well with pH. This is certainly expected, as pH is essentially a measure of acidity. An interesting question to pose, using basic chemistry knowledge, is to ask what other components other than the measured acids are affecting pH. We can quantify this difference by building a predictive linear model, to predict pH based off of TAC.acidity and capture the % difference as a new variable.
m <- lm(I(pH) ~ I(log10(TAC.acidity)), data = Wine)
Wine$pH.predictions <- predict(m, Wine)
# (observed - expected) / expected
Wine$pH.error <- (Wine$pH.predictions - Wine$pH)/Wine$pH
ggplot(data = Wine, aes(x = quality, y = pH.error)) +
geom_boxplot(outlier.colour = 'red')
We can also add something interesting to our model, to check its accuracy. The RMS Error.
rmse <- function(error)
{
sqrt(mean(error^2))
}
rmse(m$residuals)
## [1] 0.1095431
#Now, we train a Support Vector Machine.
require(e1071)
SVM <- svm(I(pH) ~ I(log10(TAC.acidity)), data = Wine)
Wine$pH.Predict.SVM <- predict(SVM,Wine)
Wine$pH.error.SVM <- (Wine$pH.Predict.SVM - Wine$pH)/Wine$pH
ggplot(data = Wine, aes(x = quality, y = pH.error.SVM)) +
geom_boxplot(outlier.colour = 'red')
rmse(SVM$residuals)
## [1] 0.106785
We see that a SVM functions slightly better than a LM.
The first RMS is of the LM, the second of the SVM.
The median % error hovered at or near zero for most wine qualities. Notably, wines rated with a quality of 3 had large negative error. We can interpret this finding by saying that for many of the ‘bad’ wines, total acidity from tartaric, acetic, and citric acids were a worse predictor of pH. Simply put, it is likely that there were other components–possibly impurities–that changed and affected the pH.
As annotated previously, I hypothesized that free.sulfur.dioxide and total.sulfur.dioxide were dependent on each other. Plotting this:
ggplot(data = Wine, aes(x = free.sulfur.dioxide, y = total.sulfur.dioxide)) +
geom_point(alpha=0.2) +
geom_smooth(se=F)
cor.test(Wine$free.sulfur.dioxide, Wine$total.sulfur.dioxide)
##
## Pearson's product-moment correlation
##
## data: Wine$free.sulfur.dioxide and Wine$total.sulfur.dioxide
## t = 35.84, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.6395786 0.6939740
## sample estimates:
## cor
## 0.6676665
It is clear that there is a very strong relationship between the two. Aside from TAC.acidity, this seemed to be the strongest bivariate relationship. Additionally, despite the telling name descriptions, the clear ‘floor’ on this graph hints that free.sulfur.dioxide is a subset of total.sulfur.dioxide.
I primarily examined the 4 features which showed high correlation with quality. These scatterplots were a bit crowded, so I faceted by rating AND by quality in the final plot to illustrate clearly and a little more about the population differences between good wines, average wines, and bad wines. It’s clear that a higher citric acid and lower volatile (acetic) acid contributes towards better wines. Likewise, better wines tended to have higher sulphates and alcohol content. Surprisingly, pH had very little visual impact on wine quality, and was shadowed by the larger impact of alcohol. Interestingly, this shows that what makes a good wine depends on the type of acids that are present.
These subplots were created to demonstrate the effect of acidity and pH on wine quality. Generally, higher acidity (or lower pH) is seen in highly-rated wines. To caveat this, a presence of volatile (acetic) acid negatively affected wine quality. Citric acidity had a high correlation with wine quality, while fixed (tartaric) acid had a smaller impact.
These boxplots demonstrate the effect of alcohol content on wine quality. Generally, higher alcohol content correlated with higher wine quality. However, as the outliers and intervals show, alchol content alone did not produce a higher quality.
This is perhaps the most telling graph. I subsetted the data to remove the ‘average’ wines, or any wine with a rating of 5 or 6. As the correlation tests show, wine quality was affected most strongly by alcohol and volatile acidity. While the boundaries are not as clear cut or modal, it’s apparent that high volatile acidity–with few exceptions–kept wine quality down. A combination of high alcohol content and low volatile acidity produced better wines.
Through this exploratory data analysis, I was able to identify the key factors that determine and drive wine quality, mainly: alcohol content, sulphates, and acidity. It is important to note, however, that wine quality is ultimately a subjective measure, albeit measured by wine experts. That said, the correlations for these variables are within reasonable bounds. The graphs adequately illustrate the factors that make good wines ‘good’ and bad wines ‘bad’. Further study with inferential statistics could be done to quantitatively confirm these assertions.